A Discovering General Prominent Streaks in Sequence Data
نویسندگان
چکیده
This paper studies the problem of prominent streak discovery in sequence data. Given a sequence of values, a prominent streak is a long consecutive subsequence consisting of only large (small) values, e.g., consecutive games of outstanding performance in sports, consecutive hours of heavy network traffic, consecutive days of frequent mentioning of a person in social media, and so on. Prominent streak discovery provides insightful data patterns for data analysis in many real-world applications and is an enabling technique for computational journalism. Given its real-world usefulness and complexity, the research on prominent streaks in sequence data opens a spectrum of challenging problems. A baseline approach to finding prominent streaks is a quadratic algorithm that exhaustively enumerates all possible streaks and performs pairwise streak dominance comparison. For more efficient methods, we make the observation that prominent streaks are in fact skyline points in two dimensions– streak interval length and minimum value in the interval. Our solution thus hinges upon the idea to separate the two steps in prominent streak discovery– candidate streak generation and skyline operation over candidate streaks. For candidate generation, we propose the concept of local prominent streak (LPS). We prove that prominent streaks are a subset of LPSs and the number of LPSs is less than the length of a data sequence, in comparison with the quadratic number of candidates produced by the brute-force baseline method. We develop efficient algorithms based on the concept of LPS. The non-linear LPS-based method (NLPS) considers a superset of LPSs as candidates, and the linear LPS-based method (LLPS) further guarantees to consider only LPSs. The proposed properties and algorithms are also extended for discovering general top-k, multi-sequence, and multi-dimensional prominent streaks. The results of experiments using multiple real datasets verified the effectiveness of the proposed methods and showed orders of magnitude performance improvement against the baseline method.
منابع مشابه
Seismic Data Forecasting: A Sequence Prediction or a Sequence Recognition Task
In this paper, we have tried to predict earthquake events in a cluster of seismic data on pacific ring of fire, using multivariate adaptive regression splines (MARS). The model is employed as either a predictor for a sequence prediction task, or a binary classifier for a sequence recognition problem, which could alternatively help to predict an event. Here, we explain that sequence prediction/r...
متن کاملA review of text mining approaches and their function in discovering and extracting a topic
Background and aim: Four text mining methods are examined and focused on understanding and identifying their properties and limitations in subject discovery. Methodology: The study is an analytical review of the literature of text mining and topic modeling. Findings: LSA could be used to classify specific and unique topics in documents that address only a single topic. The other three text min...
متن کاملPerilesional linear atrophic streaks associated with intralesional corticosteroid injections in a psoriatic plaque.
Perilymphatic atrophy can be a complication of intralesional corticosteroid injections given for the treatment of conditions such as psoriasis, alopecia areata, and keloids, and intraarticular corticosteroid injections given in diseases such as rheumatoid arthritis. It may become manifest as perilesional linear, depigmented, atrophic streaks, which are usually most prominent in patients with da...
متن کاملStreaks in Earnings Surprises and the Cross-Section of Stock Returns
The gambler’s fallacy (Rabin, 2002) predicts that trends bias investor expectations. Consistent with this prediction, we find that investors underreact to streaks of consecutive earnings surprises with the same sign. When the most recent earnings surprise extends a streak, post-earnings announcement drift is strong and significant. In contrast, the drift is negligible following the termination ...
متن کاملDiscovering Motifs in Real-World Social Networks
We built a framework for analyzing the contents of large social networks, based on the approximate counting technique developed by Gonen and Shavitt. Our toolbox was used on data from a large forum— boards.ie—the most prominent community website in Ireland. For the purpose of this experiment, we were granted access to 10 years of forum data. This is the first time the approximate counting techn...
متن کامل